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Abstract 

The problem of domain generalization is to take knowl¬ 
edge acquired from a number of related domains where 
training data is available, and to then successfully apply 
it to previously unseen domains. We propose a new fea¬ 
ture learning algorithm, Multi-Task Autoencoder (MTAE), 
that provides good generalization performance for cross¬ 
domain object recognition. 

Our algorithm extends the standard denoising autoen¬ 
coder framework by substituting artificially induced cor¬ 
ruption with naturally occurring inter-domain variability in 
the appearance of objects. Instead of reconstructing images 
from noisy versions, MTAE learns to transform the original 
image into analogs in multiple related domains. It thereby 
learns features that are robust to variations across domains. 
The learnt features are then used as inputs to a classifier. 

We evaluated the performance of the algorithm on 
benchmark image recognition datasets, where the task is 
to learn features from multiple datasets and to then predict 
the image label from unseen datasets. We found that (de¬ 
noising) MTAE outperforms alternative autoencoder-based 
models as well as the current state-of-the-art algorithms for 
domain generalization. 

1. Introduction 

Recent years have seen dramatic advances in object 
recognition by deep learning algorithms [23, U, 321. Much 
of the increased performance derives from applying large 
networks to massive labeled datasets such as PASCAL 
VOC 1141 and ImageNet (22). Unfortunately, dataset bias 
- which can include factors such as backgrounds, camera 
viewpoints and illumination - often causes algorithms to 
generalize poorly across datasets (35) and significantly lim¬ 
its their usefulness in practical applications. Developing 
algorithms that are invariant to dataset bias is therefore a 
compelling problem. 

Problem definition. In object recognition, the “visual 
world” can be considered as decomposing into views (e.g. 
perspectives or lighting conditions) corresponding to do¬ 


mains. For example, frontal-views and 45° rotated-views 
correspond to two different domains. Alternatively, we can 
associate views or domains with standard image datasets 
such as PASCAL VOC2007 (H, and Office OP . 

The problem of learning from multiple source domains 
and testing on unseen target domains is referred to as do¬ 
main generalization El 26]. A domain is a probability 
distribution from which samples {x*, yi}^ 1 are drawn. 
Source domains provide training samples, whereas distinct 
target domains are used for testing. In the standard super¬ 
vised learning framework, it is assumed that the source and 
target domains coincide. Dataset bias becomes a significant 
problem when training and test domains differ: applying a 
classifier trained on one dataset to images sampled from an¬ 
other typically results in poor performance [ 35,18]. The 
goal of this paper is to learn features that improve general¬ 
ization performance across domains. 

Contribution. The challenge is to build a system that rec¬ 
ognizes objects in previously unseen datasets, given one 
or multiple training datasets. We introduce Multi-task Au¬ 
toencoder (MTAE), a feature learning algorithm that uses a 
multi-task strategy [0|34 | to learn unbiased object features, 
where the task is the data reconstruction. 

Autoencoders were introduced to address the problem 
of ‘backpropagation without a teacher’ by using inputs as 
labels - and learning to reconstruct them with minimal 
distortion [[28] [5). Denoising autoencoders in particular 
are a powerful basic circuit for unsupervised representation 
learning (36) . Intuitively, corrupting inputs forces autoen¬ 
coders to learn representations that are robust to noise. 

This paper proposes a broader view: that autoencoders 
are generic circuits for learning invariant features. The 
main contribution is a new training strategy based on nat¬ 
urally occurring transformations such as: rotations in view¬ 
ing angle, dilations in apparent object size, and shifts in 
lighting conditions. The resulting Multi-Task Autoencoder 
learns features that are robust to real-world image variabil¬ 
ity, and therefore generalize well across domains. Exten¬ 
sive experiments show that MTAE with a denoising crite¬ 
rion outperforms the prior state-of-the-art in domain gener¬ 
alization over various cross-dataset recognition tasks. 
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2. Related work 

Domain generalization has recently attracted attention in 
classification tasks, including automatic gating of flow cy¬ 
tometry data OHO and object recognition [16, EU 1351 . 
Khosla et al. ED proposed a multi-task max-margin classi¬ 
fier, which we refer to as Undo-Bias, that explicitly encodes 
dataset-specific biases in feature space. These biases are 
used to push the dataset-specific weights to be similar to the 
global weights. Fang et al. H6l developed Unbiased Metric 
Learning (UML) based on learning to rank framework. Val¬ 
idated on weakly-labeled web images, UML produces a less 
biased distance metric that provides good object recognition 
performance, and validated on weakly-labeled web images. 
More recently, Xu et al. J38l extended an exemplar-SVM to 
domain generalization by adding a nuclear norm-based reg- 
ularizer that captures the likelihoods of all positive samples. 
The proposed model is denoted by LRE-SVM. 

Other works in object recognition exist that address a 
similar problem, in the sense of having unknown targets, 
where the unseen dataset contains noisy images that are not 
in the training set 03123 . However, these were designed 
to be noise-specific and may suffer from dataset bias when 
observing objects with different types of noise. 

A closely related task to domain generalization is do¬ 
main adaptation, where unlabeled samples from the target 
dataset are available during training. Many domain adapta¬ 
tion algorithms have been proposed for object recognition 
(see, e.g., E [311 ). Domain adaptation algorithms are not 
readily applicable to domain generalization, since no infor¬ 
mation is available about the target domain. 

Our proposed algorithm is based on the feature learn¬ 
ing approach. Feature learning has been of a great interest 
in the machine learning community since the emergence of 
deep learning (see m and references therein). Some feature 
learning methods have been successfully applied to domain 
adaptation or transfer learning applications EOS. To our 
best knowledge, there is no prior work along these lines on 
the more difficult problem of domain generalization, i.e., 
to create useful representations without observing the target 
domain. 

3. The Proposed Approach 

Our goal is to learn features that provide good domain 
generalization. To do so, we extend the autoencoder (7) 
into a model that jointly learns multiple data-reconstruction 
tasks taken from related domains. Our strategy is moti¬ 
vated by prior work demonstrating that learning from mul¬ 
tiple related tasks can improve performance on a novel, 
yet related, task - relative to methods trained on a single¬ 
task IU[3[[8l|3l. 


3.1. Autoencoders 

Autoencoders (AE) have become established as a pre¬ 
training model for deep learning 0 . The autoencoder train¬ 
ing consists of two stages: 1) encoding and 2) decoding. 
Given an unlabeled input x E R dx , a single hidden layer 
autoencoder /© (x) : R dx — R dx can be formulated as 

h = cr enc (W T x) 

X = cr dec (V T h) = /©(x), (1) 

where W E M da;Xc ^, V E W*y xdx are input- 
to-hidden and hidden-to-output connection weight^] re¬ 
spectively, h E is the hidden node vector, 

and cr enc (-) = [<Senc(^1): •••■? Senc(Zdh)] 5^dec(’) = 

[^dec(^i), •••, Sdec(zd x )] T are element-wise non-linear acti¬ 
vation functions, and s enc and sa ec are not necessarily iden¬ 
tical. Popular choices for the activation function s(-) are, 
e.g., the sigmoid s(a) = (1 + exp(—a)) -1 and the rectified 
linear (ReLU) s(a) = max(0, a). 

Let 0 = {W, V} be the autoencoder parameters and 
{ x i}iLi be a set of TV input data. Learning corresponds to 
minimizing the following objective 

N 

©:=argrmn (/©(x*), x*) + 77 ft (©), (2) 

i=1 

where £(•,•) is the loss function, usually in the form of least 
square or cross-entropy loss, and TZ(-) is a regularization 
term used to avoid overfitting. The objective ([2]) can be op¬ 
timized by the backpropagation algorithm [29]. If we ap¬ 
ply autoencoders to raw pixels of visual object images, the 
weights W usually form visually meaningful “filters” that 
can be interpreted qualitatively. 

To create a discriminative model using the learnt autoen¬ 
coder model, either of the following options can be consid¬ 
ered: 1) the feature map fi(x) := cr enc (W T x) is extracted 
and used as an input to supervised learning algorithms while 
keeping the weight matrix W fixed; 2) the learnt weight ma¬ 
trix W is used to initialize a neural network model and is 
updated during the supervised neural network training (fine- 
tuning). 

Recently, several variants such as denoising au¬ 
toencoders (DAE) fV7l and contractive autoencoders 
(CAE) E3 have been proposed to extract features that are 
more robust to small changes of the input. In DAEs, the ob¬ 
jective is to reconstruct a clean input x given its corrupted 
counterpart x ~ Q(x|x). Commonly used types of corrup¬ 
tion are zero-masking, Gaussian, and salt-and-pepper noise. 
Features extracted by DAE have been proven to be more 
discriminative than ones extracted by AE 1571 . 

1 While the bias terms are incorporated in our experiments, they are 
intentionally omitted from equations for the sake of simplicity. 
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Figure 1. The Multi-task Autoencoder (MTAE) architecture, 
which consists of three layers with multiple separated outputs; 
each output corresponds to one task/domain. 


3.2. Multi-task Autoencoders 

We refer to our proposed domain generalization algo¬ 
rithm as Multi-task Autoencoder (MTAE). From an archi¬ 
tectural viewpoint, MTAE is an autoencoder with multiple 
output layers, see Fig. [l] The input-hidden weights repre¬ 
sent shared parameters and the hidden-output weights rep¬ 
resent domain-specific parameters. The architecture is sim¬ 
ilar to the supervised multi-task neural networks proposed 
by Caruana ED- The main difference is that the output layers 
of MTAE correspond to different domains instead of differ¬ 
ent class labels. 

The most important component of MTAE is the training 
strategy, which constructs a generalized denoising autoen¬ 
coder that learns invariances to naturally occurring trans¬ 
formations. Denoising autoencoders focus on the special 
case where the transformation is simply noise. In contrast, 
MTAE training treats a specific perspective on an object as 
the “corrupted” counterpart of another perspective ( e.g ., a 
rotated digit 6 is the noisy pair of the original digit). The au¬ 
toencoder objective is then reformulated along the lines of 
multi-task learning: the model aims to jointly achieve good 
reconstruction of all source views given a particular view. 
For example, applying the strategy to handwritten digit im¬ 
ages with several views, MTAE learns representations that 
are invariant across the source views, see Section [4] 

Two types of reconstruction tasks are performed dur¬ 
ing MTAE training: 1) self-domain reconstruction and 2) 
between-domain reconstruction. Given M source domains, 
there are M x M reconstruction tasks, of which M task are 
self-domain reconstructions and the remaining M x (M— 1 ) 
tasks are between-domain reconstructions. Note that the 
self-domain reconstruction is identical to the standard au¬ 
toencoder reconstruction 0 - 

Formal description. Let {x-}^, be a set of d x - 
dimensional data points in the / th domain, where l G 
{1,..., M}. Each domain’s data points are combined into 
a matrix G R n * xc ^, where x^ T is its i th row, such 


that (xj, 


r M 


) form a category-level correspondence. 


This configuration enforces the number of samples in a cat¬ 
egory to be the same in every domain. Note that such a con¬ 
figuration is necessary to ensure that the between-domain 
reconstruction works (we will discuss how to handle the 
case with unbalanced samples in Section 3.3). The input 
and output pairs used to train MTAE can then be written as 
concatenated matrices 


X = [X 1 ; X 2 ;X M ], 

X* = [X';XG-;X Z ] (3) 

where X, X* e M jVxd - and N = Yfii n t . In words, X is 
the matrix of data points taken from all domains and X* is 
the matrix of replicated data sets taken from the I th domain. 
The replication imposed in X ; constructs input-output pairs 
for the autoencoder learning algorithm. In practice, the al¬ 
gorithm can be implemented efficiently - without replicat¬ 
ing the matrix in memory. 

We now describe MTAE more formally. Let xj and x^ T 
be the i th row of matrices X and X*, respectively, the feed¬ 
forward MTAE reconstruction is 

h i = (J e nc(W T X i ), 

/©(0 (x*) = <r dec (V( i ) T h i ), (4) 

where = {W. V (,) } contains the matrices of shared 
and individual weights, respectively. 

The MTAE training is achieved as follows. Let us define 
the loss function summed over the datapoints 

N 

£(/© ( d (xi),x-). (5) 

i=1 

Given M domains, training MTAE corresponds to minimiz¬ 
ing the objective 

M 

:= argmin^^ J(Q^) + (6) 

0(0 i=i 

where 7i(0®) is a regularization term. In this work, 
we use the standard / 2 -norm weight penalty 7 Z(S^) = 
IIW ||2 + || V® ||Stochastic gradient descent is applied on 
each reconstruction task to achieve the objective & Once 
training is completed, the optimal shared weights W are 
obtained. The stopping criterion is empirically determined 
by monitoring the average loss over all reconstruction tasks 
during training - the process is stopped when the average 
loss stabilizes. The detailed steps of MTAE training is sum¬ 
marized in Algorithm [T] 

The training protocol can be supplemented with a de¬ 
noising criterion as in (37l to induce more robust-to-noise 
features. To do so, simply replace in 0 with its cor¬ 
rupted counterpart jq ~ Q(x 2 |xJ. We name the MTAE 
model after applying the denoising criterion the Denoising 
Multi-task Auto encoder (D-MTAE). 
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Algorithm 1 The MTAE feature learning algorithm. 

Input: 

• Data matrices based on X and X ( , V/ 6 

• Source labels: {j/|}™i 1 ,VZ 6 {1 

• The learning rate: a; 

1: Initialize W G R d * xd fc and VW G R d h xd *. Vi G 

with small random real values; 

2: while not end of epoch do 

3: Do RAND-SEL as described in Section [L3l to balance the number 

of samples per categories in X and X z ; 

4: for / = 1 to A/ do 

5: for all row of X do 

6: Do a forward pass based on |4j; 

7: Update W and V^O to achieve the objective (JSJ with respect 

to the following rules 


v ij 

Wij 


(0 aj({w,v (i) }) 

13 dV^p 

W .. Q ^({W ,V ( °}) . 


8 : end for 

9: end for 

10: end while 
Output: 

• MTAE learnt weights: W VI E {1,M}; 


3.3. Handling unbalanced samples per category 

MTAE requires that every instance in a particular domain 
has a category-level corresponding pair in every other do¬ 
main. MTAE’s apparent applicability is therefore limited to 
situations where the number of source samples per category 
is the same in every domain. However, unbalanced samples 
per category occur frequently in applications. To overcome 
this issue, we propose a simple random selection procedure 
applied in the between-domain reconstructions, denoted by 
RAND-SEL, which is simply balancing the samples per cat¬ 
egory while keeping their category-level correspondence. 

In detail, the RAND-SEL strategy is as follows. Let m c be 
the number of subsamples in the c-th category, where m c = 
min(ni c , n 2 C , • • •, um c ) and ni c is the number of samples 
in the c-th category of domain l E {1,..., M}. For each 
category c and each domain l, select m c samples randomly 
such that nic = ri 2 c = ... um c = m c . This procedure is 
executed in every iteration of the MTAE algorithm, see Line 
3 of Algorithm [T] 


4. Experiments and Results 


We conducted experiments on several real world ob¬ 
ject datasets to evaluate the domain generalization abil¬ 
ity of our proposed system. In Section |4.1| we investi¬ 


gate the behaviour of MTAE in comparison to standard 
single-task autoencoder models on raw pixels as proof-of- 
principle. In Section 4.2 we evaluate the performance of 
MTAE against several state-of-the-art algorithms on mod¬ 


ern object datasets such as the Office ED, Caltech ESI . 
PASCAL VOC2007 02 ), LabelMe EH), and SUN09 (M- 

4.1. Cross-recognition on the MNIST and ETH-80 
datasets 

In this part, we aim to understand MTAE’s behavior 
when learning from multiple domains that form physically 
reasonable object transformations such as roll, pitch rota¬ 
tion, and dilation. The task is to categorize objects in views 
(domains) that were not presented during training. We eval¬ 
uate MTAE against several autoencoder models. To per¬ 
form the evaluation, a variety of object views were con¬ 
structed from the MNIST handwritten digit [241 and ETH- 
80 object [25 1 datasets. 

Data setup. We created four new datasets from MNIST 
and ETH-80 images: 1) MNIST-r, 2) MNIST-s, 3) ETH80- 
p, and 4) ETH80-y. These new sets contain multiple do¬ 
mains so that every instance in one domain has a pair in 
another domain. The detailed setting for each dataset is as 
follows. 

MNIST-r contains six domains, each corresponding to 
a degree of roll rotation. We randomly chose 1000 digit 
images of ten classes from the original MNIST training set 
to represent the basic view, i.e., 0 degree of rotation 0 each 
class has 100 images. Each image was subsampled to a 
16 x 16 representation to simplify the computation. This 
subset of 1000 images is denoted by M. We then created 5 
rotated views from M with 15° difference in counterclock¬ 
wise direction, denoted by Mi 50 , M 300 . M 450 , Mqqo , and 
M750 . The MNIST-s is the counterpart of MNIST-r, where 
each domain corresponds to a dilation factor. The views are 
denoted by M, M* 0 . 9 , M* 0 . 8 , M* 0 .7, and M* 0 .6, where the 
subscripts represent the dilation factors with respect to M. 

The ETH80-p consists of eight object classes with 10 
subcategories for each class. In each subcategory, there are 
41 different views with respect to pose angles. We took 
five views from each class denoted by E p o°, E p 22 °, E p ^o , 
E p68 °, and E p 90 o, which represent the horizontal poses, i.e., 
pitch-rotated views starting from the top view to the side 
view. This makes the number of instances only 80 for each 
view. We then greyscaled and subsampled the images to 
28 x 28. The ETH80-y contains five views of the ETH-80 
representing the vertical poses, i.e., yaw-rotated views start¬ 
ing from the right-side view to the left-side view denoted 
by E+ y90 o, E +y 450, E y0 o 9 E-yyso, and E - y90 °. Other set¬ 
tings such as the image dimensionality and preprocessing 
stage are similar to ETH80-p. Examples of the resulting 
views are depicted in Fig. [2] 

Baselines. We compared the classification performance 
of our models with several single-task autoencoder mod- 

2 Note that the rotation angle of the basic view is not perfectly 0° since 
the original MNIST images have varying appearances. 
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Table 1. The leave-one-domain-out classification accuracies % on the MNIST-r and MNIST-s. Bold-red and bold-black indicate the best 
and second best performance. 

Source | Target || Raw | AE | DAE | CAE | uDICA | MTAE | D-MTAE 

MNIST-r leave-one-roll-rotation-out 


Mi5° , Msqo , M450 , Mq 0° , M750 

M 

52.40 

74.20 

76.90 

72.10 

67.20 

77.90 

82.50 

M, M300, M450, Mq 0° , M750 

M 15 0 

74.10 

93.20 

93.20 

95.30 

87.80 

95.70 

96.30 

M, M150, M450, Mq 0° , M750 

Ms o° 

71.40 

89.90 

91.30 

92.60 

88.80 

91.20 

93.40 

M, M150, M300, Mq o° , M750 

M450 

61.40 

82.20 

81.10 

81.50 

77.80 

77.30 

78.60 

M, M150, M300, M450, M750 

Mq o° 

67.40 

90.00 

92.80 

92.70 

84.20 

92.40 

94.20 

M, M150, M300, M450, Mq qo 

M750 

55.40 

73.80 

76.50 

79.30 

69.50 

79.90 

80.50 

Average 

63.68 

83.88 

85.30 

85.58 

79.22 

85.73 

87.58 


MNIST-s leave-one-dilation-out 


A/*o.9> M* 0 . 8 , M*o.7> M*0.6 

M 

54.00 

67.50 

71.80 

75.80 

75.80 

74.50 

76.00 

M, M* 0 . 8 , M*o.7> M*q q 

M*o.9 

80.40 

95.10 

94.00 

94.90 

88.60 

97.80 

98.00 

M, M* 0 . 9 , M* 0 . 7 , M*o .6 

M* 0.8 

82.60 

94.60 

92.90 

94.90 

86.60 

96.30 

96.40 

M, M* 0 . 9 , M* 0 . 8 , M*o .6 

M* 0.7 

78.20 

93.70 

91.60 

92.50 

87.40 

95.80 

94.90 

M, M* 0 . 9 , M* 0 . 8 , M* 0.7 

M* 0.6 

64.70 

74.80 

76.10 

77.50 

75.30 

78.00 

78.30 

Average 

71.98 

85.14 

85.28 

87.12 

82.74 

88.48 

88.72 
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Figure 2. Some image examples from the MNIST-r, MNIST-s, and 
ETH80-p . 


els: Descriptions of the methods and their hyperparameter 
settings are provided below. 

• AE 0: the standard autoencoder model trained by 
stochastic gradient descent, where all object views 
were concatenated as one set of inputs. The number of 
hidden nodes was fixed at 500 on the MNIST dataset 
and at 1000 on the ETH-80 dataset. The learning rate, 
weight decay penalty, and number of iterations were 
empirically determined at 0.1, 3 x 10 -4 , and 100, re¬ 
spectively. 

• DAE El: the denoising autoencoder with zero- 
masking noise, where all object views were concate¬ 
nated as one set of input data. The corruption level 
was fixed at 30% for all cases. Other hyper-parameter 
values were identical to AE. 

• CAE E2: the autoencoder model with the Jacobian 
matrix norm regularization referred to as the contrac¬ 


tive autoencoder. The corresponding regularization 
constant A was set at 0.1. 

• MTAE: our proposed multi-task autoencoder model 
with identical hyper-parameter settings as AE, except 
for the learning rate set at 0.03, which was also chosen 
empirically. This value provides a lower reconstruc¬ 
tion error for each task and visually clearer first layer 
weights. 

• D-MTAE: MTAE with a denoising criterion. The 
learning rate was set the same as MTAE; other hyper¬ 
parameters followed DAE. 

We also evaluated the unsupervised Domain-Invariant 
Component Analysis (uDICA) [26] on these datasets for 
completness. The hyper-parameters were tuned using 10- 
fold cross-validation on source domains. We also did ex¬ 
periments using the supervised variant, DICA, with the 
same tuning strategy. Surprisingly, the peak performance of 
uDICA is consistently higher than DICA. A possible expla¬ 
nation is that the Dirac kernel function measuring the label 
similarity is less appropriate in this application. 

We normalized the raw pixels to a range of [0,1] for 
autoencoder-based models and h-unit ball for uDICA. We 
evaluated the classification accuracies of the learnt features 
using multi-class SVM with linear kernel (L-SVM) fl2l . 
Using a linear kernel keeps the classifier simple - since our 
main focus is on the feature extraction process. The LIB- 
LINEAR package El was used to run the L-SVM. 

Cross-domain recognition results. We evaluated the ob¬ 
ject classification accuracies of each algorithm by leave- 
one-domain-out test, i.e., taking one domain as the test 
set and the remaining domains as the training set. For all 
autoencoder-based algorithms, we repeated the experiments 
on each leave-one-domain-out case 30 times and reported 
the average accuracies. The standard deviations are not re¬ 
ported since they are small (±0.1). 
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The detailed results on the MNIST-r and MNIST-s can 
be seen in Table [T] On average, MTAE has the second 
best classification accuracies, and in particular outperforms 
single-task autoencoder models. This indicates that the 
multi-task feature learning strategy can provide better dis¬ 
criminative features than the single-task feature learning 
w.r.t. unseen object views. 

The algorithm with the best performance is on these 
datasets is D-MTAE. Specifically, D-MTAE performs best 
on average and also on 9 out of 11 individual cross-domain 
cases of the MNIST-r and MNIST-s. The closest single-task 
feature learning competitor to D-MTAE is CAE. This sug¬ 
gests that the denoising criterion strongly benefits domain 
generalization. The denoising criterion is also useful for 
single-task feature learning although it does not yield com¬ 
petitive accuracies, see AE and DAE performance. 

We also obtain a consistent trend on the ETH80-p and 
ETH80-y datasets, i.e., D-MTAE and MTAE are the best 
and second best models. In detail, D-MTAE and MTAE 
produce the average accuracies of 87.85% and 87.50% on 
the ETH80-p, and 97% and 96.50% on the ETH80-y. 

Observe that there is an anomaly in the MNIST-r dataset: 
the performance on M 450 is far worse than its neighbors 
(M 30 o, M 60 o ). This anomaly appears to be related to the ge¬ 
ometry of the MNIST-r digits. We found that the most fre¬ 
quently misclassified digits are 4, 6 , and 9 on M 45 o, which 
rarely occurs on other MNIST-r’s domains - typically 4 as 
9, 6 as 4, and 9 as 8 . The same phenomenon applies to 
L-SVM. 


Weight visualization. Useful insight is obtained from 
considering the qualitative outcome of the MTAE training 
by visualizing the first layer weights. Figure [4] depicts the 
weights of some autoencoder models, including ours, on the 
MNIST-r dataset. Both MTAE and D-MTAE’s weights form 
u filters ” that tend to capture the underlying transformation 
across the MNIST-r views, which is the rotation. This effect 
is unseen in AE and DAE, the filters of which only explain 
the contents of handwritten digits in the form of Fourier 
component-like descriptors such as local blob detectors and 
stroke detectors G 3 . This might be a reason that MTAE 
and D-MTAE features can provide better domain general¬ 
ization than AE and DAE, since they implicitly capture the 
relationship among the source domains. 

Next we discuss the difference between MTAE and D- 
MTAE filters. The D-MTAE filters not only capture the ob¬ 
ject transformation, but also produce features that describe 
the object contents more distinctively. These filters basi¬ 
cally combine both properties of the DAE and MTAE filters 
that might benefit the domain generalization. 

Invariance analysis. A possible explanation for the ef¬ 
fectiveness of MTAE relates to the dimensionality of 
the manifold in feature space where samples concentrate. 


MNIST 



Figure 3. The average singular value spectrum of the Jacobian ma¬ 
trix over the MNIST-r and MNIST-s datasets. 


We hypothesize that if features concentrate near a low¬ 
dimensional submanifold, then the algorithm has found 
simple invariant features and will generalize well. 

To test the hypothesis, we examine the singular value 


dzi 


dxj 


where 


spectrum of the Jacobian matrix J7 x (z) = 

x and z are the input and feature vectors respectively ED 
The spectrum describes the local dimensionality of the man¬ 
ifold around which samples concentrate. If the spectrum de¬ 
cays rapidly, then the manifold is locally of low dimension. 

Figure [3] depicts the average singular value spectrum on 
test samples from MNIST-r and MNIST-s. The spectrum 
of D-MTAE decays the most rapidly, followed by MTAE 
and then DAE (with similar rates), and AE decaying the 
slowest. The ranking of decay rates of the four algorithms 
matches their ranking in terms of empirical performance in 
Table [I] Figure [3] thus provides partial confirmation for our 
hypothesis. However, a more detailed analysis is necessary 
before drawing strong conclusions. 


4.2. Cross-recognition on the Office, Caltech, 
VOC2007, LabelMe, and SUN09 datasets 

In the second set of experiments, we evaluated the cross¬ 
recognition performance of the proposed algorithms on 
modem object datasets. The aim is to show that MTAE 
and D-MTAE are applicable and competitive in the more 
general setting. We used the Office, Caltech, PASCAL 
VOC2007, LabelMe, and SUN09 datasets from which we 
formed two cross-domain datasets. Our general strategy is 
to extend the generalization of features extracted from the 
current best deep convolutional neural network (23]. 

Data Setup. The first cross-domain dataset consists 
of images from PASCAL VOC2007 (V), LabelMe (L), 
Caltech-101 (C), and SUN09 (S) datasets, each of which 
represents one domain. C is an object-centric dataset, while 
V, L, and S are scene-centric. This dataset, which we abbre- 
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(d) D-MTAE 


Figure 4. The 2D visualization of 100 randomly chosen weights after pretraining on the MNIST-r dataset. Each patch corresponds to a row 
of the learnt weight matrix W that represents a “filter”. The weight value wij > 3 is depicted with white, Wij < — 3 is depicted with 
black, otherwise it is gray. 


viate as VLCS, shares five object categories: ‘bird’, ‘car’, 
‘chair’, ‘dog’, and ‘person’. Each domain in the VLCS 
dataset was divided into a training set (70%) and a test set 
(30%) by random selection from the overall dataset. The 
detailed training-test configuration for each domain is sum¬ 
marized in Table [2] Instead of using the raw features di¬ 
rectly, we employed the DeCAFg features G3 as inputs to 
the algorithms. These features have dimensionality of 4,096 
and are publicly available]^] 

The second cross-domain dataset is referred to as the Of- 
fice+Caltech [31, 19] dataset that contains four domains: 
Amazon (A), Webcam (W), DSLR (D), and Caltech-256 
(C), which share ten common categories. This dataset has 
8 to 151 instances per category per domain, and 2,533 in¬ 
stances in total. We also used the DeCAF 6 features ex¬ 
tracted from this dataset, which are also publicly available]^] 

Table 2. The number of training and test instances for each domain 
in the VLCS dataset. _ 


Domain 

VOC2007 

LabelMe 

Caltech-101 

SUN09 

#training 

2,363 

1,859 

991 

2,297 

#test 

1,013 

797 

424 

985 


Table 3. The groundtruth L-SVM accuracies % on the standard 
training-test evaluation. The left-most column indicates the train¬ 
ing set, while the upper-most row indicates the test set. 


Training/Test 

VOC2007 

LabelMe 

Caltech-101 

SUN09 

VOC2007 

66.34 

34.50 

65.09 

52.49 

LabelMe 

44.03 

68.76 

43.87 

41.02 

Caltech-101 

52.81 

32.37 

95.99 

39.29 

SUN09 

52.42 

42.03 

40.33 

74.21 


Training protocol. On these datasets, we utilized the 
MTAE or D-MTAE learning as pretraining for a fully- 
connected neural network with one hidden layer (1HNN). 
The number of hidden nodes was set at 2,000, which is 
less than the input dimensionality. In the pretraining stage, 
the number of output layers was the same as the number 
of source domains -each corresponds to a particular source 


domain. The sigmoid activation and linear activation func¬ 
tions were used for cr enc (-) and tJdecG)- 

The MTAE pretraining was run with the learning rate 
at 5 x 10 -4 , the number of epochs at 500, and the batch 
size at 10, which were empirically determined w.r.t. the 
smallest average reconstruction loss. D-MTAE has the 
same hyper-parameter setting as MTAE except the addi¬ 
tional zero-masking corruption level at 20%. After the pre¬ 
training is completed, we then performed back-propagation 
fine-tuning using 1HNN with softmax output, where the 
first layer weights were initialized by either the MTAE 
or D-MTAE learnt weights. The supervised learning 
hyper-parameters were tuned using 10-fold cross validation 
(10FCV) on source domains. We denote the overall models 
by MTAE+1HNN and D-MTAE+1HNN. 

Baselines. We compared our proposed models with six 
baselines: 

1. L-SVM: an SVM with linear kernel. 

2. 1HNN: a single hidden layer neural network without 
pretraining. 

3. DAE+1HNN: a two-layer neural network with denot¬ 
ing autoencoder pretraining (DAE+1HNN). 

4. Undo-Bias Ell : a multi-task SVM-based algorithm 
for undoing dataset bias. Three hyper-parameters 
(A, C\,Ci) require tuning by 10FCV. 

5. UML ESI: a structural metric learning-based algo¬ 
rithm that aims to learn a less biased distance met¬ 
ric for classification tasks. The initial tuning proposal 
for this method was using a set of weakly-labeled data 
retrieved from querying class labels to search engine. 
However, here we tuned the hyperparameters using the 
same strategy as others (10FCV) for a fair comparison. 

6. LRE-SVM f38l : a non-linear exemplar-SVMs model 
with a nuclear norm regularization to impose a 
low-rank likelihood matrix. Four hyper-parameters 
(Ai, A 2 , Ci, C 2 ) were tuned using 10FCV. 


3 http://www.cs.dartmouth.edu/chenfang/proj.page/PXR.iccvi3/index.php The last three are the state-of-the-art domain generalization 
%ttp: / /vc.sce.ntu.edu.sg/transfer_learning_domain_adaptation/ algorithms for object recognition. 
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Table 4. The cross-recognition accuracy % on the VLCS dataset. 


Source 

Target 

L-SVM 

1HNN 

DAE+1HNN 

Undo-Bias 

UML 

LRE-SVM 

MTAE+1HNN 

D-MTAE+1HNN 

L,C,S 

V 

58.86 

59.10 

62.00 

54.29 

56.26 

60.58 

61.09 

63.90 

v,c,s 

L 

52.49 

58.20 

59.23 

58.09 

58.50 

59.74 

59.24 

60.13 

V,L,S 

C 

77.67 

86.67 

90.24 

87.50 

91.13 

88.11 

90.71 

89.05 

V,L,C 

s 

49.09 

57.86 

57.45 

54.21 

58.49 

54.88 

60.20 

61.33 

Avg. 

59.93 

65.46 

67.23 

63.52 

65.85 

65.83 

67.81 

68.60 


Table 5. The cross-recognition accuracy % on the Office+Caltech dataset. 


Source 

Target 

L-SVM 

1HNN 

DAE+1HNN 

Undo-Bias 

UML 

LRE-SVM 

MTAE+1HNN 

D-MTAE+1HNN 

A,C 

D,W 

82.08 

83.41 

82.05 

80.49 

82.29 

84.59 

84.23 

85.35 

D,W 

A,C 

76.12 

76.49 

79.04 

69.98 

79.54 

81.17 

79.30 

80.52 

C,D,W 

A 

90.61 

92.13 

92.02 

90.98 

91.02 

91.87 

92.20 

93.13 

A,W,D 

C 

84.51 

85.89 

85.17 

85.95 

84.59 

86.38 

85.98 

86.15 

Avg. 

83.33 

84.48 

84.70 

81.85 

84.36 

86.00 

85.43 

86.29 


We report the performance in terms of the classification 
accuracy (%) following Xu et al. 138.]. For all algorithms 
that are optimized stochastically, we ran independent train¬ 
ing processes using the best performing hyper-parameters 
in 10 times and reported the average accuracies. Similar 
to the previous experiment, we do not report the standard 
deviations due to their small values (±0.2). 

Results on the VLCS dataset. We first conducted the 
standard training-test evaluation using L-SVM, i.e., learn¬ 
ing the model on a training set from one domain and test¬ 
ing it on a test set from another domain, to check the 
groundtruth performance and also to identify the existence 
of the dataset bias. The performance is summarized in Ta¬ 
ble^ We can see that the bias indeed exists in every domain 
despite the use of DeCAFe, the sixth layer features of the 
state-of-the-art deep convolutional neural network. The per¬ 
formance gap between the best cross-domain performance 
and the groundtruth is large, with > 14% difference. 

We then evaluated the domain generalization perfor¬ 
mance of each algorithm. We conducted leave-one-domain- 
out evaluation, which induces four cross-domain cases. 
The complete recognition results are shown in Table [4] 
In general, the dataset bias can be reduced by all algo¬ 
rithms after learning from multiple source domains (com¬ 
pare, e.g ., the minimum accuracy over the first row -V as 
the target- in Table [4] with the maximum cross-recognition 
accuracy over the VOC2007’s column in Table[3]). Further¬ 
more, Caltech-101, which is object-centric, appears to be 
the easiest dataset to recognize, consistent with an inves¬ 
tigation in (351: scene-centric datasets tend to generalize 
well over object-centric datasets. Surprisingly, the perfor¬ 
mance of 1HNN has already achieved competitive accuracy 
compared to more complicated state-of-the-art algorithms, 
Undo-Bias, UML, and LRE-SVM. Furthermore, D-MTAE 
outperforms other algorithms on three out of four cross¬ 
domain cases and on average, while MTAE has the second 
best performance on average. 


Results on the Office+Caltech dataset. We report the 
experiment results on the Office+Caltech dataset. Table [5] 
summarizes the recognition accuracies of each algorithm 
over four cross-domain cases. D-MTAE+1HNN has the 
best performance on two out of four cross-domain cases 
and ranks second for the remaining cases. On average, D- 
MTAE+1HNN has better performance than the prior state- 
of-the-art on this dataset, LRE-SVM lf38ll . 

5. Conclusions 

We have proposed a new approach to multi-task feature 
learning that reduces dataset bias in object recognition. The 
main idea is to extract features shared across domains via 
a training protocol that, given an image from one domain, 
learns to reconstruct analogs of that image for all domains. 
The strategy yields two variants: the Multi-task Autoen¬ 
coder (MTAE) and the Denoising MTAE which incorpo¬ 
rates a denoising criterion. A comprehensive suite of cross¬ 
domain object recognition evaluations shows that the algo¬ 
rithms successfully learn domain-invariant features, yield¬ 
ing state-of-the-art performance when predicting the labels 
of objects from unseen target domains. 

Our results suggest several directions for further study. 
Firstly, it is worth investigating whether stacking MTAEs 
improves performance. Secondly, more effective proce¬ 
dures for handling unbalanced samples are required, since 
these occur frequently in practice. Finally, a natural appli¬ 
cation of MTAEs is to streaming data such as video , where 
the appearance of objects transforms in real-time. 

The problem of dataset bias remains far from solved: the 
best model on the VLCS dataset achieved accuracies less 
than 70% on average. A partial explanation for the poor 
performance compared to supervised learning is insufficient 
training data: the class-overlap across datasets is quite small 
(only 5 classes are shared across VLCS). Further progress 
in domain generalization requires larger datasets. 
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